INN Hotels Project¶
Context¶
A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
- Loss of resources (revenue) when the hotel cannot resell the room.
- Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
- Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
- Human resources to make arrangements for the guests.
Objective¶
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
Data Description¶
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
- Booking_ID: unique identifier of each booking
- no_of_adults: Number of adults
- no_of_children: Number of Children
- no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
- no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
- type_of_meal_plan: Type of meal plan booked by the customer:
- Not Selected – No meal plan selected
- Meal Plan 1 – Breakfast
- Meal Plan 2 – Half board (breakfast and one other meal)
- Meal Plan 3 – Full board (breakfast, lunch, and dinner)
- required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
- room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
- lead_time: Number of days between the date of booking and the arrival date
- arrival_year: Year of arrival date
- arrival_month: Month of arrival date
- arrival_date: Date of the month
- market_segment_type: Market segment designation.
- repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
- no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
- no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
- avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
- no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
- booking_status: Flag indicating if the booking was canceled or not.
Importing necessary libraries and data¶
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Data Handling
import pandas as pd
import numpy as np
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import plot_tree
# Data preprocessing & Feature Engineering
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
# Machine Learning Models
#from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier # Tree-based models
from sklearn.linear_model import LogisticRegression # Logistic regression for baseline
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier # XGBoost for boosting performance
# Model Selection
from sklearn.model_selection import GridSearchCV
# Model Evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_recall_curve
# For bold text in print statements
from rich.console import Console
from rich import print
console = Console()
Data Overview¶
- Observations
- Sanity checks
df = pd.read_csv('INNHotelsGroup.csv')
df.head(1)
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.0 | 0 | Not_Canceled |
Dimensions, Col names & Data types¶
# Check dimensions, column names
console.print(
'[bold]Dataframe Shape:[/bold]\n',df.shape,'\n',
'[bold]Columns:[/bold]\n',df.columns,'\n')
Dataframe Shape: (36275, 19) Columns: Index(['Booking_ID', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved', 'lead_time', 'arrival_year', 'arrival_month', 'arrival_date', 'market_segment_type', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'booking_status'], dtype='object')
# Define colors for each data type
type_colors = {
"object": "green",
"int64": "red",
"float64": "magenta"
}
# Determine the max width for alignment
max_col_width = max(len(column) for column in df.columns) + 2 # Add padding
# Build a single formatted string with alignment
output = "\n".join(
f"{column.ljust(max_col_width)} [{type_colors.get(str(dtype), 'white')}] {dtype} [/]"
for column, dtype in zip(df.columns, df.dtypes)
)
# Print everything at once
console.print('[bold]Data Type: [/bold]\n',output)
Data Type: Booking_ID object no_of_adults int64 no_of_children int64 no_of_weekend_nights int64 no_of_week_nights int64 type_of_meal_plan object required_car_parking_space int64 room_type_reserved object lead_time int64 arrival_year int64 arrival_month int64 arrival_date int64 market_segment_type object repeated_guest int64 no_of_previous_cancellations int64 no_of_previous_bookings_not_canceled int64 avg_price_per_room float64 no_of_special_requests int64 booking_status object
Missing Values & Duplicates¶
console.print('[bold]Missing Values:[/bold]\n', df.isnull().sum(),'\n',
'\n[bold]Duplicated Rows:[/bold]', df.duplicated().sum())
Missing Values: Booking_ID 0 no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64 Duplicated Rows: 0
Summary Statistics¶
# Numerical Statistics
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.0 | 1.844962 | 0.518715 | 0.0 | 2.0 | 2.00 | 2.0 | 4.0 |
| no_of_children | 36275.0 | 0.105279 | 0.402648 | 0.0 | 0.0 | 0.00 | 0.0 | 10.0 |
| no_of_weekend_nights | 36275.0 | 0.810724 | 0.870644 | 0.0 | 0.0 | 1.00 | 2.0 | 7.0 |
| no_of_week_nights | 36275.0 | 2.204300 | 1.410905 | 0.0 | 1.0 | 2.00 | 3.0 | 17.0 |
| required_car_parking_space | 36275.0 | 0.030986 | 0.173281 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| lead_time | 36275.0 | 85.232557 | 85.930817 | 0.0 | 17.0 | 57.00 | 126.0 | 443.0 |
| arrival_year | 36275.0 | 2017.820427 | 0.383836 | 2017.0 | 2018.0 | 2018.00 | 2018.0 | 2018.0 |
| arrival_month | 36275.0 | 7.423653 | 3.069894 | 1.0 | 5.0 | 8.00 | 10.0 | 12.0 |
| arrival_date | 36275.0 | 15.596995 | 8.740447 | 1.0 | 8.0 | 16.00 | 23.0 | 31.0 |
| repeated_guest | 36275.0 | 0.025637 | 0.158053 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
| no_of_previous_cancellations | 36275.0 | 0.023349 | 0.368331 | 0.0 | 0.0 | 0.00 | 0.0 | 13.0 |
| no_of_previous_bookings_not_canceled | 36275.0 | 0.153411 | 1.754171 | 0.0 | 0.0 | 0.00 | 0.0 | 58.0 |
| avg_price_per_room | 36275.0 | 103.423539 | 35.089424 | 0.0 | 80.3 | 99.45 | 120.0 | 540.0 |
| no_of_special_requests | 36275.0 | 0.619655 | 0.786236 | 0.0 | 0.0 | 0.00 | 1.0 | 5.0 |
# Categorical Statistics
df.describe(include='object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Booking_ID | 36275 | 36275 | INN36275 | 1 |
| type_of_meal_plan | 36275 | 4 | Meal Plan 1 | 27835 |
| room_type_reserved | 36275 | 7 | Room_Type 1 | 28130 |
| market_segment_type | 36275 | 5 | Online | 23214 |
| booking_status | 36275 | 2 | Not_Canceled | 24390 |
df.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Histogram, Boxplot & Correlation Heatmap¶
# Histogram to see the distribution of numerical features
df.hist(figsize=(12, 12), bins=20, color='skyblue')
plt.show()
# Select only numerical columns
numerical_columns = df.select_dtypes(include=[np.number]).columns
# Define the number of rows and columns for the grid
n_cols = 3 # Number of columns in the grid
n_rows = int(np.ceil(len(numerical_columns) / n_cols)) # Calculate the number of rows needed
# Create a figure and axes
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(15, n_rows * 5))
# Flatten the axes array for easy iteration
axes = axes.flatten()
# Plot each numerical column as a boxplot
for ax, column in zip(axes, numerical_columns):
sns.boxplot(y=df[column], ax=ax)
ax.set_title(f'Boxplot of {column}')
# Remove any empty subplots
for i in range(len(numerical_columns), len(axes)):
fig.delaxes(axes[i])
# Adjust layout
plt.tight_layout()
# Show the plot
plt.show()
Note:
To include boolean variables in the heatmap some data types must be changed
# dtype change to bool
df['required_car_parking_space'] = df['required_car_parking_space'].astype(bool)
df['repeated_guest'] = df['repeated_guest'].astype(bool)
df['booking_status'] = df['booking_status'].map({'Canceled': 1, 'Not_Canceled': 0}).astype(bool)
Arrival year month and date must be arranged into proper date format
df['arrival_datetime'] = pd.to_datetime(df[['arrival_year', 'arrival_month', 'arrival_date']].rename(columns={
'arrival_year': 'year',
'arrival_month': 'month',
'arrival_date': 'day'
}), errors='coerce')
df[df['arrival_datetime'].isna()][['arrival_year', 'arrival_month', 'arrival_date']].head(2)
| arrival_year | arrival_month | arrival_date | |
|---|---|---|---|
| 2626 | 2018 | 2 | 29 |
| 3677 | 2018 | 2 | 29 |
# Drop columns for year month and date that are no longer needed
df.drop(columns=['arrival_year', 'arrival_month', 'arrival_date'], inplace=True)
Note:
Some bookings are on dates that do not exist
numerical_bool_df = df.select_dtypes(include=[np.number,bool, 'datetime64']).copy()
# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(numerical_bool_df.corr(), annot=True, cmap="coolwarm", annot_kws={"size": 8})
plt.show()
Analysis:
lead_time (0.44): Longer lead times (time between booking and arrival) are positively correlated with cancellations.
no_of_special_requests (-0.25): The negative correlation suggests that more special requests are associated with fewer cancellations.
repeated_guest (-0.12): Repeated guests are less likely to cancel.
no_of_previous_bookings_not_canceled (-0.11): More previous successful bookings indicate a lower likelihood of cancellations.
required_car_parking_space (-0.086): Guests needing parking spaces are slightly less likely to cancel.
Plot date vs bookings¶
df.rename(columns={'arrival_datetime': 'arrival_date'}, inplace=True)
df['arrival_date'] = pd.to_datetime(df['arrival_date'])
# Group by month and year
monthly_bookings = df.groupby(pd.Grouper(key='arrival_date', freq='ME'))['booking_status'].count()
# Plot
plt.figure(figsize=(10, 6))
monthly_bookings.plot()
plt.title('Number of Bookings Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Bookings')
plt.show()
Leading Questions:
- What are the busiest months in the hotel?
- Which market segment do most of the guests come from?
- Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
- What percentage of bookings are canceled?
- Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
- Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
- What are the busiest months in the hotel?
# Extract month from arrival_date
df['arrival_month'] = df['arrival_date'].dt.month
# Count bookings per month
monthly_bookings = df['arrival_month'].value_counts().sort_index()
# Visualize
monthly_bookings.plot(kind='bar', figsize=(10, 6), color='skyblue')
plt.title("Monthly Bookings")
plt.ylabel("Number of Bookings")
plt.xlabel("Month")
plt.xticks(ticks=range(12), labels=[
'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
])
plt.show()
- Which market segment do most of the guests come from?
# Count bookings per market segment
market_segment_counts = df['market_segment_type'].value_counts()
# Visualize
market_segment_counts.plot(kind='bar', figsize=(10, 6), color='lightcoral')
plt.title("Market Segment Distribution")
plt.ylabel("Number of Guests")
plt.xlabel("Market Segment")
plt.show()
- Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
# Average room price per market segment
avg_price_by_segment = df.groupby('market_segment_type')['avg_price_per_room'].mean()
# Visualize
avg_price_by_segment.plot(kind='bar', figsize=(10, 6), color='mediumseagreen')
plt.title("Average Room Price by Market Segment")
plt.ylabel("Average Room Price")
plt.xlabel("Market Segment")
plt.show()
- What percentage of bookings are canceled?
# Booking status distribution numeric column
df['booking_status_numeric'] = df['booking_status'].astype(int)
# Calculate cancellation rate
cancellation_rate = df['booking_status_numeric'].mean() * 100
print(f"Cancellation Rate: {cancellation_rate:.2f}%")
Cancellation Rate: 32.76%
- Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
# Filter for repeating guests
repeating_guests = df[df['repeated_guest'] == 1]
# Calculate cancellation rate for repeating guests
repeating_guest_cancellation_rate = repeating_guests['booking_status_numeric'].mean() * 100
print(f"Cancellation Rate for Repeating Guests: {repeating_guest_cancellation_rate:.2f}%")
# Filter for non-repeating guests
first_time_guests = df[df['repeated_guest'] == 0]
# Calculate cancellation rate for first-time guests
first_time_guest_cancellation_rate = first_time_guests['booking_status_numeric'].mean() * 100
print(f"Cancellation Rate for First-Time Guests: {first_time_guest_cancellation_rate:.2f}%")
Cancellation Rate for Repeating Guests: 1.72%
Cancellation Rate for First-Time Guests: 33.58%
- Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
# Group by number of special requests and calculate cancellation rate
special_request_cancellation = df.groupby('no_of_special_requests')['booking_status_numeric'].mean() * 100
# Visualize
special_request_cancellation.plot(kind='bar', figsize=(10, 6), color='cornflowerblue')
plt.title("Cancellation Rate by Number of Special Requests")
plt.ylabel("Cancellation Rate (%)")
plt.xlabel("Number of Special Requests")
plt.show()
Data Preprocessing¶
- Missing value treatment (if needed)
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
missing_values = df.isnull().sum()[lambda x: x > 0]
print(missing_values)
df.dropna(inplace=True)
arrival_date 37 arrival_month 37 dtype: int64
Note:
Although the date and month appear to have small correlation to cancelation rates I will still drop the missing values due to these reservations not being reliable
Feature Engineering¶
# Apply log transformation (add 1 to avoid log(0))
df['lead_time_log'] = np.log1p(df['lead_time'])
df['avg_price_per_room_log'] = np.log1p(df['avg_price_per_room'])
# Total Stay column
df['total_stay'] = df['no_of_weekend_nights'] + df['no_of_week_nights']
# Column for Season
df['season'] = df['arrival_month'].map({
12: 'Winter', 1: 'Winter', 2: 'Winter',
3: 'Spring', 4: 'Spring', 5: 'Spring',
6: 'Summer', 7: 'Summer', 8: 'Summer',
9: 'Fall', 10: 'Fall', 11: 'Fall'
})
df = pd.get_dummies(df, columns=['season'], drop_first=True)
# Indicator for special requests
df['has_special_requests'] = (df['no_of_special_requests'] > 0).astype(int)
Hot-one enconding¶
hotone_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
df = pd.get_dummies(df, columns=hotone_cols, drop_first=True)
Outliers detection and handling¶
# Select only numerical columns
numerical_columns = df.select_dtypes(include=[np.number]).columns
# Define the number of rows and columns for the grid
n_cols = 3 # Number of columns in the grid
n_rows = int(np.ceil(len(numerical_columns) / n_cols)) # Calculate the number of rows needed
# Create a figure and axes
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(15, n_rows * 5))
# Flatten the axes array for easy iteration
axes = axes.flatten()
# Plot each numerical column as a boxplot
for ax, column in zip(axes, numerical_columns):
sns.boxplot(y=df[column], ax=ax)
ax.set_title(f'Boxplot of {column}')
# Remove any empty subplots
for i in range(len(numerical_columns), len(axes)):
fig.delaxes(axes[i])
# Adjust layout
plt.tight_layout()
# Show the plot
plt.show()
High Relevance Variables¶
- lead_time (high correlation with booking_status_numeric and many outliers)
- no_of_special_requests (moderate correlation)
- avg_price_per_room (important for business insights and has many outliers)
- total_stay (newly created and potentially impactful)
lead_time probably indicates early planners and should not be removed
avg_price_per_room indicates luxury bookings and wont be removed
- no_of_adults has values in 0 which should not be possible and will be removed
# Remove no_of_adults with value 0
df = df[df['no_of_adults'] > 0]
EDA¶
- It is a good idea to explore the data once again after manipulating it.
# Histogram to see the distribution of numerical features
df.hist(figsize=(12, 12), bins=20, color='skyblue')
plt.show()
numerical_bool_df = df.select_dtypes(include=[np.number,bool, 'datetime64']).copy()
# Correlation heatmap
plt.figure(figsize=(27, 15))
sns.heatmap(numerical_bool_df.corr(), annot=True, cmap="coolwarm", annot_kws={"size": 8})
plt.show()
# heatmap for only booking_status
plt.figure(figsize=(4,8))
sns.heatmap(numerical_bool_df.corr()[['booking_status_numeric']].sort_values(by='booking_status_numeric', ascending=False),
annot=True, cmap="coolwarm", annot_kws={"size": 6})
plt.yticks(fontsize=6)
plt.show()
Checking Multicollinearity¶
- In order to make statistical inferences from a logistic regression model, it is important to ensure that there is no multicollinearity present in the data.
# VIF is a measure of multicollinearity among the independent variables within a regression model.
# this will help us to identify the features that are highly correlated with each other
# and ensure there is no multicollinearity in the data
from statsmodels.stats.outliers_influence import variance_inflation_factor
- Steps to Test for Multicollinearity Using VIF
Import Required Libraries:
Use statsmodels to calculate VIF values for each feature.
- Calculate VIF:
VIF values are calculated for numerical features (including dummy variables) but not for categorical variables directly.
Interpret VIF Values:
VIF < 5: Low multicollinearity (acceptable).
VIF 5–10: Moderate multicollinearity (needs attention).
VIF > 10: High multicollinearity (problematic; consider removing the feature).
# Dropping perfect collinearity columns
df.drop(columns={'booking_status_numeric'}, inplace=True)
# drop id column
df.drop(columns={'Booking_ID'}, inplace=True)
# drop week and weekend nights in place of total stay
df.drop(columns={'no_of_weekend_nights', 'no_of_week_nights'}, inplace=True)
# Select only numerical features (including dummy variables)
features_to_include = df.select_dtypes(include=['float64', 'int64', 'bool'])
X = features_to_include
# Check data types of the columns
non_numeric_cols = ~X.dtypes.isin(['int64', 'float64', 'bool'])
console.print(non_numeric_cols[non_numeric_cols])
# if nothing is printed, then all columns are numeric or boolean
Series([], dtype: bool)
# Select only numerical features (including dummy variables)
features_to_include = df.select_dtypes(include=['float64', 'int64', 'bool'])
X = features_to_include
# Check if any column contains non-numeric data
non_numeric_cols = X.select_dtypes(exclude=['float64', 'int64', 'bool']).columns
print("Non-Numeric Columns:", non_numeric_cols)
X = X.apply(pd.to_numeric, errors='coerce')
bool_df = X.select_dtypes(include=[bool])
# Convert boolean columns to integers
X[bool_df.columns] = X[bool_df.columns].astype(int)
##### check for missing values and infinite values
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
Non-Numeric Columns: Index([], dtype='object')
# Check for constant columns
constant_columns = [col for col in X.columns if X[col].nunique() == 1]
print("Constant Columns:", constant_columns)
# Drop constant columns
X = X.drop(columns=constant_columns)
Constant Columns:
[]
# Check for highly correlated features (perfect collinearity)
corr_matrix = X.corr().abs()
high_corr_pairs = [(i, j) for i in corr_matrix.columns for j in corr_matrix.columns
if i != j and corr_matrix.loc[i, j] == 1]
print("Perfectly Correlated Features:", high_corr_pairs)
# Drop one of the correlated features (Example: Keep one room type)
if high_corr_pairs:
X = X.drop(columns=[pair[1] for pair in high_corr_pairs])
Perfectly Correlated Features:
[]
# Check for NaN values
print("Missing values:", X.isnull().sum().sum())
# Replace NaNs with median or drop rows
X = X.fillna(X.median()) # Alternative: X.dropna()
# Check for infinite values
print("Infinite values:", np.isfinite(X.values).all())
# Replace infinite values with NaN and drop them
X = X.replace([np.inf, -np.inf], np.nan).dropna()
Missing values: 0
Infinite values: True
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data) # Display final VIF values
feature VIF 0 no_of_adults 19.568398 1 no_of_children 2.200580 2 required_car_parking_space 1.077707 3 lead_time 6.856512 4 repeated_guest 1.814900 5 no_of_previous_cancellations 1.362887 6 no_of_previous_bookings_not_canceled 1.623736 7 avg_price_per_room 44.291580 8 no_of_special_requests 6.756292 9 booking_status 2.308347 10 arrival_month 13.510787 11 lead_time_log 25.272502 12 avg_price_per_room_log 200.469494 13 total_stay 4.365344 14 season_Spring 3.138926 15 season_Summer 2.310281 16 season_Winter 1.854382 17 has_special_requests 7.489517 18 type_of_meal_plan_Meal Plan 2 1.342316 19 type_of_meal_plan_Meal Plan 3 1.018499 20 type_of_meal_plan_Not Selected 1.462514 21 room_type_reserved_Room_Type 2 1.048836 22 room_type_reserved_Room_Type 3 1.002365 23 room_type_reserved_Room_Type 4 1.662713 24 room_type_reserved_Room_Type 5 1.040740 25 room_type_reserved_Room_Type 6 2.180334 26 room_type_reserved_Room_Type 7 1.132211 27 market_segment_type_Complementary 1.429518 28 market_segment_type_Corporate 7.621751 29 market_segment_type_Offline 36.422360 30 market_segment_type_Online 78.104797
# Issue: avg_price_per_room_log (200.47) vs. avg_price_per_room (44.29)
X = X.drop(columns=['avg_price_per_room'])
# Issue: market_segment_type_Online (78.10) & market_segment_type_Offline (36.42)
X = X.drop(columns=['market_segment_type_Offline'])
# Issue: lead_time_log (25.27)
X = X.drop(columns=['lead_time'])
# just a binary version of no_of_special_requests
X = X.drop(columns=['has_special_requests'])
- Note:
After making these changes rerun VIF to confirm improvements
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# Display VIF values
print(vif_data)
feature VIF 0 no_of_adults 18.778147 1 no_of_children 2.153121 2 required_car_parking_space 1.072586 3 repeated_guest 1.802609 4 no_of_previous_cancellations 1.362596 5 no_of_previous_bookings_not_canceled 1.623170 6 no_of_special_requests 2.280386 7 booking_status 2.065711 8 arrival_month 12.710654 9 lead_time_log 10.710983 10 avg_price_per_room_log 34.539206 11 total_stay 4.262850 12 season_Spring 2.961666 13 season_Summer 2.270895 14 season_Winter 1.628948 15 type_of_meal_plan_Meal Plan 2 1.257353 16 type_of_meal_plan_Meal Plan 3 1.018309 17 type_of_meal_plan_Not Selected 1.420447 18 room_type_reserved_Room_Type 2 1.046752 19 room_type_reserved_Room_Type 3 1.002255 20 room_type_reserved_Room_Type 4 1.494297 21 room_type_reserved_Room_Type 5 1.023855 22 room_type_reserved_Room_Type 6 1.998473 23 room_type_reserved_Room_Type 7 1.082093 24 market_segment_type_Complementary 1.348934 25 market_segment_type_Corporate 1.589260 26 market_segment_type_Online 4.995929
# Is redundant and might not add much value
X = X.drop(columns=['arrival_month'])
# Might be highliy correlated so creating a new feature and dropping the original ones
X['total_guests'] = X['no_of_adults'] + X['no_of_children']
X = X.drop(columns=['no_of_adults', 'no_of_children'])
# Recalculate VIF
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# Display updated VIF values
print(vif_data)
feature VIF 0 required_car_parking_space 1.072318 1 repeated_guest 1.796372 2 no_of_previous_cancellations 1.357731 3 no_of_previous_bookings_not_canceled 1.622594 4 no_of_special_requests 2.279328 5 booking_status 2.046587 6 lead_time_log 10.132456 7 avg_price_per_room_log 22.811984 8 total_stay 4.246001 9 season_Spring 1.630292 10 season_Summer 1.830723 11 season_Winter 1.481123 12 type_of_meal_plan_Meal Plan 2 1.256882 13 type_of_meal_plan_Meal Plan 3 1.017902 14 type_of_meal_plan_Not Selected 1.399529 15 room_type_reserved_Room_Type 2 1.044558 16 room_type_reserved_Room_Type 3 1.002159 17 room_type_reserved_Room_Type 4 1.448029 18 room_type_reserved_Room_Type 5 1.023324 19 room_type_reserved_Room_Type 6 1.341422 20 room_type_reserved_Room_Type 7 1.067688 21 market_segment_type_Complementary 1.201069 22 market_segment_type_Corporate 1.577327 23 market_segment_type_Online 4.983414 24 total_guests 16.163514
- Note:
Average price per room is still quite high
since it is esential another transformation will be tested (square root instead of log)
df['avg_price_per_room'] = np.expm1(df['avg_price_per_room_log']) # Revert log1p
df['avg_price_per_room_sqrt'] = np.sqrt(df['avg_price_per_room'])
X = X.drop(columns=['avg_price_per_room_log']) # Remove log version
X['avg_price_per_room_sqrt'] = df['avg_price_per_room_sqrt'] # Add square root version
# Recalculate VIF
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# Display updated VIF values
print(vif_data)
feature VIF 0 required_car_parking_space 1.073215 1 repeated_guest 1.790544 2 no_of_previous_cancellations 1.357719 3 no_of_previous_bookings_not_canceled 1.622781 4 no_of_special_requests 2.279911 5 booking_status 2.040480 6 lead_time_log 9.266992 7 total_stay 4.185891 8 season_Spring 1.614460 9 season_Summer 1.830182 10 season_Winter 1.464924 11 type_of_meal_plan_Meal Plan 2 1.278025 12 type_of_meal_plan_Meal Plan 3 1.017897 13 type_of_meal_plan_Not Selected 1.390154 14 room_type_reserved_Room_Type 2 1.045731 15 room_type_reserved_Room_Type 3 1.002190 16 room_type_reserved_Room_Type 4 1.455880 17 room_type_reserved_Room_Type 5 1.026627 18 room_type_reserved_Room_Type 6 1.307934 19 room_type_reserved_Room_Type 7 1.065076 20 market_segment_type_Complementary 1.197903 21 market_segment_type_Corporate 1.535639 22 market_segment_type_Online 5.104850 23 total_guests 16.134442 24 avg_price_per_room_sqrt 20.202305
avg_price_per_room_sqrt (20.20)
Convert it to a z-score (mean 0, variance 1):
X['avg_price_per_room_std'] = (X['avg_price_per_room_sqrt'] - X['avg_price_per_room_sqrt'].mean()) / X['avg_price_per_room_sqrt'].std()
X = X.drop(columns=['avg_price_per_room_sqrt']) # Drop original transformed column
total_guests (16.13)
May still be correlated with room type or market segment.
Subtract the mean to reduce variance:
X['total_guests_centered'] = X['total_guests'] - X['total_guests'].mean()
X = X.drop(columns=['total_guests'])
# Recalculate VIF after dropping/modifying final high-VIF features
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
# Display updated VIF values
print(vif_data)
feature VIF 0 required_car_parking_space 1.074273 1 repeated_guest 1.799452 2 no_of_previous_cancellations 1.357548 3 no_of_previous_bookings_not_canceled 1.621802 4 no_of_special_requests 2.282089 5 booking_status 2.079970 6 lead_time_log 5.890001 7 total_stay 4.026819 8 season_Spring 1.544822 9 season_Summer 1.748815 10 season_Winter 1.486905 11 type_of_meal_plan_Meal Plan 2 1.251314 12 type_of_meal_plan_Meal Plan 3 1.018011 13 type_of_meal_plan_Not Selected 1.397751 14 room_type_reserved_Room_Type 2 1.045835 15 room_type_reserved_Room_Type 3 1.002165 16 room_type_reserved_Room_Type 4 1.529099 17 room_type_reserved_Room_Type 5 1.031070 18 room_type_reserved_Room_Type 6 1.445345 19 room_type_reserved_Room_Type 7 1.085403 20 market_segment_type_Complementary 1.666606 21 market_segment_type_Corporate 1.458023 22 market_segment_type_Online 4.438248 23 avg_price_per_room_std 2.215886 24 total_guests_centered 1.577906
Final Heatmap analysis¶
corr_matrix = X.corr()
# Plot heatmap
plt.figure(figsize=(15, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5, annot_kws={"size": 8})
plt.title("Updated Correlation Heatmap (After VIF Analysis)")
plt.show()
final_variables = ['lead_time_log', 'no_of_special_requests', 'avg_price_per_room_std', 'repeated_guest',
'total_guests_centered', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
'market_segment_type_Online', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 6',
'total_stay', 'season_Spring', 'season_Summer', 'season_Winter']
Building a Logistic Regression model¶
fig, ax = plt.subplots(figsize = (3, 3))
sns.barplot(
data = (X
.groupby(['booking_status'])
.size()
.reset_index(name = 'n_customers')),
x = 'booking_status',
y = 'n_customers'
)
plt.title("Does customer check-in with children?")
plt.plot()
[]
df = X.copy()
y = df['booking_status']
X = df[final_variables]
# Check the class distribution
print(y.value_counts(normalize=True)) # Percentage distribution
booking_status 0 0.672179 1 0.327821 Name: proportion, dtype: float64
- Note:
This is a moderate imbalance and doesn't require aggressive balancing techniques like downsampling or SMOTE. Using stratification during train-test split should be sufficient to handle this level of imbalance, especially since it's above the 20%-30% threshold for severe imbalance.
# Split the data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Stratisfy to maintain class balance
# Check class distribution in the splits
print("Training set class distribution:")
print(y_train.value_counts(normalize=True))
print("\nTesting set class distribution:")
print(y_test.value_counts(normalize=True))
Training set class distribution:
booking_status 0 0.672184 1 0.327816 Name: proportion, dtype: float64
Testing set class distribution:
booking_status 0 0.672161 1 0.327839 Name: proportion, dtype: float64
- Note:
Class proportions are mantained
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize Logistic Regression model with class weights for imbalance
log_model = LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000)
# Train the model
log_model.fit(X_train_scaled, y_train)
LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
# Make predictions on the test set
y_pred = log_model.predict(X_test_scaled)
Model performance evaluation¶
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Confusion matrix
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
Accuracy: 0.75
Classification Report:
precision recall f1-score support
0 0.87 0.74 0.80 4853
1 0.60 0.78 0.68 2367
accuracy 0.75 7220
macro avg 0.73 0.76 0.74 7220
weighted avg 0.78 0.75 0.76 7220
Confusion Matrix: [[3594 1259] [ 517 1850]]
coefficients = pd.DataFrame({
'Feature': X.columns,
'Coefficient': log_model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)
print("\nFeature Coefficients:")
print(coefficients)
Feature Coefficients:
Feature Coefficient 0 lead_time_log 1.287630 7 market_segment_type_Online 0.754744 2 avg_price_per_room_std 0.616135 5 no_of_previous_cancellations 0.102900 10 total_stay 0.094150 4 total_guests_centered -0.003087 12 season_Summer -0.023359 11 season_Spring -0.041543 6 no_of_previous_bookings_not_canceled -0.046954 9 room_type_reserved_Room_Type 6 -0.077767 8 room_type_reserved_Room_Type 4 -0.128675 13 season_Winter -0.238586 3 repeated_guest -0.311251 1 no_of_special_requests -1.159303
- Performance
Overall Performance
Accuracy: 75% The model correctly predicts booking status (canceled or not) for 3 out of 4 cases.Performance on Non-Canceled Bookings (Class 0)
Precision: 87% When the model predicts a booking is not canceled, it is correct 87% of the time. Recall: 74% The model correctly identifies 74% of non-canceled bookings.Performance on Canceled Bookings (Class 1)
Precision: 60% When the model predicts a booking is canceled, it is correct 60% of the time. Recall: 78% The model successfully identifies 78% of canceled bookings, showing good sensitivity to the minority class.F1-Score
Class 0: 0.80 (Good balance between precision and recall for non-canceled bookings). Class 1: 0.68 (Moderate balance for canceled bookings).
Hyperparameter Test¶
param_grid = {
'C': [0.01, 0.1, 1, 10, 100], # Regularization strength
'penalty': ['l2'], # Type of regularization
'solver': ['lbfgs', 'liblinear', 'saga'], # Optimization solver
}
log_model = LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000)
grid_search = GridSearchCV(
estimator=log_model,
param_grid=param_grid,
scoring='f1_weighted', # Use F1 score to balance precision and recall
cv=5, # 5-fold cross-validation
verbose=2, # Print progress
n_jobs=-1 # Use all available CPUs
)
grid_search.fit(X_train_scaled, y_train)
Fitting 5 folds for each of 15 candidates, totalling 75 fits
GridSearchCV(cv=5,
estimator=LogisticRegression(class_weight='balanced',
max_iter=1000, random_state=42),
n_jobs=-1,
param_grid={'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l2'],
'solver': ['lbfgs', 'liblinear', 'saga']},
scoring='f1_weighted', verbose=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
estimator=LogisticRegression(class_weight='balanced',
max_iter=1000, random_state=42),
n_jobs=-1,
param_grid={'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l2'],
'solver': ['lbfgs', 'liblinear', 'saga']},
scoring='f1_weighted', verbose=2)LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=42)
LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=42)
# Print the best parameters and corresponding score
print("Best Parameters:", grid_search.best_params_)
print("Best F1 Score:", grid_search.best_score_)
Best Parameters: {'C': 1, 'penalty': 'l2', 'solver': 'lbfgs'}
Best F1 Score: 0.7591421641875913
best_log_model = grid_search.best_estimator_
best_log_model.fit(X_train_scaled, y_train)
LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=42)
# Make predictions on the test set
y_pred = best_log_model.predict(X_test_scaled)
Model Evaluation¶
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))
# Confusion matrix
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
Accuracy: 0.75
Classification Report:
precision recall f1-score support
0 0.87 0.74 0.80 4853
1 0.60 0.78 0.68 2367
accuracy 0.75 7220
macro avg 0.73 0.76 0.74 7220
weighted avg 0.78 0.75 0.76 7220
Confusion Matrix: [[3594 1259] [ 517 1850]]
# Extract coefficients
coefficients = pd.DataFrame({
'Feature': X.columns,
'Coefficient': best_log_model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)
print("\nFeature Coefficients:")
print(coefficients)
Feature Coefficients:
Feature Coefficient 0 lead_time_log 1.287630 7 market_segment_type_Online 0.754744 2 avg_price_per_room_std 0.616135 5 no_of_previous_cancellations 0.102900 10 total_stay 0.094150 4 total_guests_centered -0.003087 12 season_Summer -0.023359 11 season_Spring -0.041543 6 no_of_previous_bookings_not_canceled -0.046954 9 room_type_reserved_Room_Type 6 -0.077767 8 room_type_reserved_Room_Type 4 -0.128675 13 season_Winter -0.238586 3 repeated_guest -0.311251 1 no_of_special_requests -1.159303
Model Performance Summary
Accuracy: 75%
Best F1 Score: 0.759
Good recall for cancellations (78%), meaning the model correctly identifies most canceled bookings.
(warning!)Precision for cancellations is lower (60%), leading to 1,259 false positives (non-canceled bookings wrongly classified as canceled).
Key Feature Insights
Increases Cancellations:
Longer lead time (+1.29)
Online bookings (+0.75)
Higher room price (+0.62)
Reduces Cancellations:
More special requests (-1.16)
Loyal (repeated) guests (-0.31)
Hyperparameter Adjustment¶
# Get predicted probabilities for the positive class (canceled bookings)
y_prob = best_log_model.predict_proba(X_test_scaled)[:, 1]
# Define thresholds to test
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for threshold in thresholds:
y_pred_adjusted = (y_prob >= threshold).astype(int)
print(f"\nThreshold: {threshold}")
print(classification_report(y_test, y_pred_adjusted))
Threshold: 0.3
precision recall f1-score support
0 0.93 0.53 0.68 4853
1 0.49 0.92 0.64 2367
accuracy 0.66 7220
macro avg 0.71 0.73 0.66 7220
weighted avg 0.79 0.66 0.66 7220
Threshold: 0.4
precision recall f1-score support
0 0.91 0.65 0.75 4853
1 0.54 0.87 0.67 2367
accuracy 0.72 7220
macro avg 0.73 0.76 0.71 7220
weighted avg 0.79 0.72 0.73 7220
Threshold: 0.5
precision recall f1-score support
0 0.87 0.74 0.80 4853
1 0.60 0.78 0.68 2367
accuracy 0.75 7220
macro avg 0.73 0.76 0.74 7220
weighted avg 0.78 0.75 0.76 7220
Threshold: 0.6
precision recall f1-score support
0 0.84 0.82 0.83 4853
1 0.65 0.68 0.66 2367
accuracy 0.77 7220
macro avg 0.74 0.75 0.75 7220
weighted avg 0.78 0.77 0.77 7220
Threshold: 0.7
precision recall f1-score support
0 0.81 0.91 0.86 4853
1 0.75 0.57 0.65 2367
accuracy 0.80 7220
macro avg 0.78 0.74 0.75 7220
weighted avg 0.79 0.80 0.79 7220
# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
# Plot the curve
plt.figure(figsize=(8, 6))
plt.plot(thresholds, precision[:-1], label="Precision", linestyle="--")
plt.plot(thresholds, recall[:-1], label="Recall")
plt.xlabel("Decision Threshold")
plt.ylabel("Score")
plt.legend()
plt.title("Precision-Recall Tradeoff")
plt.show()
optimal_threshold = 0.4 # Adjust this based on your results
y_pred_final = (y_prob >= optimal_threshold).astype(int)
# Evaluate final model with new threshold
print("\n🔹 Final Model Evaluation (Threshold Adjusted):")
print(classification_report(y_test, y_pred_final))
print(confusion_matrix(y_test, y_pred_final))
🔹 Final Model Evaluation (Threshold Adjusted):
precision recall f1-score support
0 0.91 0.65 0.75 4853
1 0.54 0.87 0.67 2367
accuracy 0.72 7220
macro avg 0.73 0.76 0.71 7220
weighted avg 0.79 0.72 0.73 7220
[[3135 1718] [ 319 2048]]
Recommendation
Choose a Threshold Based on Business Goals:
Focus on Recall (e.g., 0.3) if detecting as many cancellations as possible is critical (e.g., to mitigate loss). Focus on Precision (e.g., 0.7) if you want to avoid false positives (incorrectly flagging non-canceled bookings). Balanced Approach (e.g., 0.4 or 0.5) provides a good trade-off.
0.4 is choosen to focus on maximizing cancellations detectd while maintaining a high recall.
threshold = 0.4
y_pred_final = (y_prob >= threshold).astype(int)
print("\nFinal Model Evaluation (Threshold = 0.4):")
print(classification_report(y_test, y_pred_final))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_final))
Final Model Evaluation (Threshold = 0.4):
precision recall f1-score support
0 0.91 0.65 0.75 4853
1 0.54 0.87 0.67 2367
accuracy 0.72 7220
macro avg 0.73 0.76 0.71 7220
weighted avg 0.79 0.72 0.73 7220
Confusion Matrix:
[[3135 1718] [ 319 2048]]
# Extract coefficients
coefficients = pd.DataFrame({
'Feature': X.columns,
'Coefficient': best_log_model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)
print("\nFeature Coefficients (Importance):")
print(coefficients)
Feature Coefficients (Importance):
Feature Coefficient 0 lead_time_log 1.287630 7 market_segment_type_Online 0.754744 2 avg_price_per_room_std 0.616135 5 no_of_previous_cancellations 0.102900 10 total_stay 0.094150 4 total_guests_centered -0.003087 12 season_Summer -0.023359 11 season_Spring -0.041543 6 no_of_previous_bookings_not_canceled -0.046954 9 room_type_reserved_Room_Type 6 -0.077767 8 room_type_reserved_Room_Type 4 -0.128675 13 season_Winter -0.238586 3 repeated_guest -0.311251 1 no_of_special_requests -1.159303
import joblib
joblib.dump(best_log_model, 'logistic_regression_model.pkl')
['logistic_regression_model.pkl']
Final Model Summary¶
The logistic regression model was developed to predict booking cancellations for INN Hotels Group. With a threshold of 0.4, the model achieved an accuracy of 72%, with a recall of 87% for canceled bookings and a precision of 54%. This ensures most cancellations are correctly identified, aligning with the objective of proactively addressing cancellations.
Key drivers influencing cancellations include longer lead times, online bookings, and higher room prices, all positively correlated with cancellations. Conversely, special requests, repeated guests, and winter bookings reduce the likelihood of cancellations.
Based on these findings, actionable recommendations include implementing stricter refund policies for long lead times and online bookings, encouraging prepayments for high-risk segments, and leveraging customer loyalty through rewards for repeated guests. This model provides a strong foundation for improving INN Hotels' cancellation management strategies.
Building a Decision Tree model¶
# Initialize the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
# Fit the model
dt_model.fit(X_train, y_train)
# Predict on the test set
y_pred = dt_model.predict(X_test)
# Evaluate the model
print("Baseline Decision Tree Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Baseline Decision Tree Model Performance:
Accuracy: 0.86
precision recall f1-score support
0 0.89 0.89 0.89 4853
1 0.78 0.78 0.78 2367
accuracy 0.86 7220
macro avg 0.84 0.84 0.84 7220
weighted avg 0.86 0.86 0.86 7220
Confusion Matrix:
[[4335 518] [ 521 1846]]
plt.figure(figsize=(50, 50))
plot_tree(dt_model, feature_names=X.columns, class_names=["Not Canceled", "Canceled"], filled=True)
plt.show()
Hiperparameters¶
# Define the parameter grid
param_grid = {
'max_depth': [3, 5, 10, 15, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 5],
'criterion': ['gini', 'entropy']
}
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),
param_grid=param_grid,
scoring='f1',
cv=5,
verbose=1)
# Fit the model
grid_search.fit(X_train, y_train)
# Get the best parameters
print("Best Parameters:", grid_search.best_params_)
best_dt_model = grid_search.best_estimator_
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'criterion': 'gini', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2}
# Predict on the test set
y_pred_tuned = best_dt_model.predict(X_test)
print("Tuned Decision Tree Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_tuned):.2f}")
print(classification_report(y_test, y_pred_tuned))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_tuned))
Tuned Decision Tree Model Performance:
Accuracy: 0.87
precision recall f1-score support
0 0.89 0.91 0.90 4853
1 0.81 0.77 0.79 2367
accuracy 0.87 7220
macro avg 0.85 0.84 0.85 7220
weighted avg 0.86 0.87 0.86 7220
Confusion Matrix:
[[4428 425] [ 544 1823]]
# Get feature importance
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Importance': best_dt_model.feature_importances_
}).sort_values(by='Importance', ascending=False)
print("Feature Importance:")
print(feature_importance)
Feature Importance:
Feature Importance 0 lead_time_log 0.421241 2 avg_price_per_room_std 0.204940 7 market_segment_type_Online 0.118506 1 no_of_special_requests 0.092267 10 total_stay 0.057346 4 total_guests_centered 0.039987 13 season_Winter 0.029093 11 season_Spring 0.013804 12 season_Summer 0.013015 8 room_type_reserved_Room_Type 4 0.007780 3 repeated_guest 0.000782 9 room_type_reserved_Room_Type 6 0.000568 5 no_of_previous_cancellations 0.000525 6 no_of_previous_bookings_not_canceled 0.000147
joblib.dump(best_dt_model, 'decision_tree_model.pkl')
['decision_tree_model.pkl']
Do we need to prune the tree?¶
The tree was needed to be pruned due to the massive number of size of the orignial tree
Model Performance Comparison and Conclusions¶
After evaluating both the Logistic Regression and Decision Tree models, I found that the Decision Tree performs best overall for predicting booking cancellations. With an accuracy of 87%, it outperforms Logistic Regression (72%) and achieves higher precision (81%), meaning it reduces false alarms when identifying cancellations. However, Logistic Regression has a higher recall (87%), making it better at detecting actual cancellations, which could be useful if the priority is to capture as many cancellations as possible. On the other hand, the Decision Tree's slightly lower recall (77%) means it misses more cancellations, but its higher precision ensures that when a booking is predicted as canceled, it is more likely correct. Given these trade-offs, I would recommend the Decision Tree as the best-balanced model, offering strong predictive performance while remaining interpretable.
Final Model Tuned Decision tree
Actionable Insights and Recommendations¶
- What profitable policies for cancellations and refunds can the hotel adopt?
- What other recommedations would you suggest to the hotel?
To reduce cancellations and maximize revenue, INN Hotels Group should implement dynamic refund policies based on lead time, enforcing stricter rules for early bookings while offering flexibility for last-minute reservations. Given that online bookings and higher-priced rooms have higher cancellation rates, requiring non-refundable deposits or offering discounts for non-cancelable rates can improve retention. Encouraging direct bookings with exclusive perks and rewarding repeat guests or those with special requests with more flexible policies can further reduce cancellations. To optimize occupancy, the hotel should adopt a strategic overbooking model based on cancellation trends, implement a waitlist system, and adjust seasonal pricing and cancellation policies accordingly. Additionally, enhancing guest engagement through automated reminders, incentives for keeping bookings, and personalized communication can strengthen booking commitment and improve overall guest satisfaction.